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. In this paper, a sparsity-aware adaptive algorithm for distributed learning in diffusion networks is developed. 

Tij" ' The algorithm follows the set-theoretic estimation rationale. At each time instance and at each node of the network, 

a closed convex set, known as property set, is constructed based on the received measurements; this defines the 
region in which the solution is searched for. In this paper, the property sets take the form of hyperslabs. The goal is 
tyj ' to find a point that belongs to the intersection of these hyperslabs. To this end, sparsity encouraging variable metric 

projections onto the hyperslabs have been adopted. Moreover, sparsity is also imposed by employing variable metric 
projections onto weighted £i balls. A combine adapt cooperation strategy is adopted. Under some mild assumptions, 
^ : the scheme enjoys monotonicity, asymptotic optimality and strong convergence to a point that lies in the consensus 

' subspace. Finally, numerical examples verify the validity of the proposed scheme, compared to other algorithms, 

in 

\ which have been developed in the context of sparse adaptive learning. 

I. Introduction 

• T-H , Sparsity, i.e., the presence of a few number of non-zero coefficients of a signal/parameter vector to be estimated, 

\^ ' has been attracting, recently, an overwhelming interest under the Compressed Sensing (CS) framework HI, JSJ- 



However, most of the efforts, so far, have been invested in CS-based signal recovery techniques, which are appropriate 
for batch mode operation. Accordingly, the estimation of the signal parameters can be achieved only after a fixed 
number of measurements has been collected and stored. If a new measurement becomes available, the whole 
estimation process has to be repeated from scratch. As the number of measurements increases, the computational 
burden becomes prohibitive for real time applications. On the contrary, time-adaptive/online updating succeeds in 
improving the current estimate dynamically as new measurements are obtained. Moreover, batch methods are not 
directly suited for time varying scenarios, where the parameter vector changes, as time evolves. Online, learning 
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techniques overcome the previously mentioned limitations. Online techniques for sparsity-aware learning have 
recently become the focus of intense research activity, e.g., lU-lISl. 

In this paper, the task of sparsity-aware learning is treated in the context of distributed processing ||6l-||9l. To 
be more specific, we consider the typical setup of a distributed network, in which the estimate of the unknown 
parameter vector is based on noisy measurements sensed by a number of spatially distributed nodes. This task can 
be fulfilled following several approaches, with the centralized solution being one of them. In such a scenario, the 
nodes transmit the measured information to a central node, called fusion center, which carries out the full amount 
of computations. Nevertheless, the existence of a fusion center is not always feasible due to power or geographical 
constraints. Furthermore, this approach lacks robustness, since if the fusion center is malfunctioning then the network 
collapses. Hence, in many applications, a decentralized philosophy has to be followed, in which the nodes themselves 
take part in the computation task. The most celebrated examples of such networks are: 

• The incremental, in which each node is able to communicate with only one neighbouring node and, henceforth, 
the nodes are part of a cyclic pattern, e.g., ifTOl . ITTl . This topology requires small bandwidth, albeit it is not 
robust when a number of nodes are malfunctioning, since, when a node fails, the network collapses. 

• The diffusion, where each node shares information with a subset of nodes. Despite the fact that the diffusion 
topology requires larger bandwidth, compared to the incremental one, it is robust to cope with node failures, 
and it's implementation turns out to be easier when large networks are involved ||6l, Q, ||9l, lfT2l . 

Although there are a few sparsity-aware methods for batch processing in distributed learning, e.g., |fT3l , |[T4l . to the 
best of our knowledge there is no algorithm, yet, capable for time-adaptive/online processing to operate in diffusion 
networks. 

The algorithm, to be presented here, handles the requests for sparsity-awareness and operation in diffusion 
networks, simultaneously. It follows the set-theoretic estimation rationale |[T31 , that is, instead of seeking for a 
(unique) optimum vector, we search for a set of points that are in agreement with the received set of measurements. 
To this end, at each time instance, a closed convex set, namely a hyperslab, is defined by the currently received 
input-output training data pair, and any point that lies within this set is considered to be in agreement with the 
current measurements. Moreover, following similar philosophy as in lO, in order to exploit the a-priori knowledge 
concerning the sparsity of the unknown vector, we constrain the search for a possible solution within sparsity- 
promoting weighted ii balls. The goal becomes that of finding a point that lies in the intersection of the infinite 
number of hyperslabs with the previously mentioned constraint sets; this is successfully solved (see for example 
lfT6l - lfT8l ) by employing a sequence of projections onto the hyperslabs and the weighted £i balls. In the current study, 
the previous scheme is enchanced by reformulating the projection operators appropriately so as to exploit further the 
a-priori information with respect to the sparsity of the unknown vector This can be achieved (see for example ||T9| ), 
by adopting the variable metric projections rationale. As a consequence, the variable metric projections improve 
the convergence speed, when seeking for a sparse vector, since different weights are assigned at each coefficient of 
the updated vector, and, through this procedure, small coefficients are forced to diminish faster The reasoning of 
assigning different weights at each coefficient, is also met in the so called proportionate algorithms ll20l . II2TI . 



The paper is organized as follows. In section |II] the general problem is described and in the next section, adaptive 
strategies for estimating sparse signals are provided. In section lTV] we shed light on basic concepts regarding adaptive 
distributed learning and in section [V] the proposed algorithm, together with its theoretical analysis, is discussed. 
Finally, in section [Vl] the performance of the proposed algorithm is validated and in the Appendices the theoretical 
background is discussed, and full proofs of the theorems are given. 

II. Problem Formulation 

The set of all real numbers and the set of all non-negative integers are denoted by M and Z>o, respectively. Given 
two integers ji, j2, with ji < 72, we define ji, j2 = {ji, • • • ,.72}- The stage of discussion will be the Euclidean 
space M™, where m is a positive integer We denote vectors by boldface letters, e.g., h, and matrices with upper- 
case boldfaced letters. Furthermore, we define the weighted inner product as follows: Vhi, h2 G W", {hi, h2)v '■= 
hJVh2, and the weighted norm Vft. £ R™, \\h\\v = \J {h, h)v, where the m x m matrix, V, is positive definite, 
and the notation (•)^ stands for the transposition operator The Euclidean norm, i.e., || ||, is a special case of the 
previously mentioned norm, and occurs if V = Im, where /,„ is the m x m identity matrix. Moreover, the 2- 
norm of a matrix, say A, is denoted by i| Given a vector h = [hi, . . . , hmY' '= K"^, the Hi norm is defined 
\\h\\i := YllLi l^il' ^iid '^he support set, supp(/i) := {i E l,m : hi 7^ 0}. Finally, the ^0 "norm" is the cardinality 
of the support set, i.e., ||/i||o '■= |supp(/i)|, where given a set, say S, the notation \S\ stands for it's cardinality. 

Consider the problem of estimating an unknown parameter vector h* e M™, exploiting measurements {dn, Un)n£Z 
M. X M"', which are related via the linear system 

dn = u^h* + Vn, Vn e Z>o, (1) 

where Vn is the noise process. We assume that h* is sparse, i.e., |l/i.*|lo ^ rn, or, in other words, it has a few 
number of non-zero coefficients. Suppose that a finite number of measurements, say N, is available. In that case, 
([T]l can be written as 

d = Uh* +v, 

where the regression matrix U = [ui, . . . , mat]^ G R^^^™, d = [di, dN^ € ^ v = [vi, e R^, 

and N < m. Classical techniques, as for example the celebrated least-squares method, fail to produce a good 
estimate of the unknown parameters, since the sparsity of h* is not taken into consideration and, consequently, 
there is no guarantee, for a finite number of measurements, that the estimate will predict the support, i.e., the set 
of non-zero components, and force the rest to become zero. This results at an increased misadjustment between the 
true and the estimated values, ll22l . Nevertheless, one can resort to a sparsity promoting technique, namely Least 
Absolute Shrinkage and Selection Operator (Lasso), and overstep the previously mentioned problem. Analytically, 
the Lasso estimator promotes sparsity, by solving the following optimization task 



h = argmin||f,||^<i-||d - Uhf, 



where the term \\d — Uh\\ accounts for the error residual in the estimation process, and the £i norm promotes 
sparsity by shrinking small coefficient values towards zero, e.g., Il23l . Most of the emphasis in solving the Lasso 
problem has been given on batch techniques, see, e.g., 11241 . However, such techniques are inappropriate for online 
learning, where data arrive sequentially and/or the environment is not stationary but it undergoes changes as time 
evolves. 

III. Sparsity-aware adaptive algorithms 

Although sparsity promoting adaptive algorithms have drawn the attention of the signal processing community 
for many years, see, e.g., Il20l . Il2l1 . it is only recently that the topic is being treated in a more theoretically sound 
framework, within the spirit of £i regularization, e.g., Il3l-||5l, 1251 , ||26l . The a-priori information concerning the 
underlying sparsity is provided via a constraint built around the £i norm. Providing this a-priori information, the 
convergence rate is improved significantly, and the associated error floor in the steady state is reduced, as well. 

As it is often the case, most of these efforts evolve along the three main axes in adaptive filtering. One is 
along the gradient descend rationale, as this is represented in the adaptive learning by the LMS ID, 1261 . The other 
direction follows Newton-type arguments, as represented by the RLS ||5]. The other route is more recent and builds 
upon recent extensions of the classical Projections Onto Convex Sets (POCS) theory, which allow for applications 
in the online time-adaptive setting, e.g., lT6l - lfT8l . Il27l . Our new algorithm belongs to this last category and it 
exploits its potential to allow for convex constraints to be efficiently incorporated within the algorithmic flow. 

A. Set-theoretic estimation approach and variable metric projections 

In this paper, the set-theoretic estimation rationale e.g., |[T5l , ITSl , ||28l , will be adopted. The philosophy behind 
this family of algorithms is that instead of adopting a loss function to be optimized, in order to obtain an estimate 
of the unknown target parameter vector, one obtains an estimate that lies in the intersection of an infinite number 
of convex sets. Each one of the (convex) property sets, is constructed using the information that is provided by the 
respective measurement pair ((i„, m„), and basically defines, in turn, a region where the unknown vector lies with a 
high probabihty, based on the received information and the assumed nature of the noise source. We say that such a 
convex set is "in agreement" with the received measurement pair. Moreover, in the presence of convex constraints, 
each of them defines a convex region and the solution is searched in the intersection of all the involved sets, those 
associated with the measurements as well as those with the constraints. 

The strategy used in order to achieve the previously mentioned goal of finding a point that lies in the intersection 
of the infinite number of convex sets was presented in lT6l . This algorithmic scheme can be seen as a generalization 
of the POCS theory ITSl . l29l , ll30l . The difference lies in the fact that in the classical POCS theory, a finite number 
of convex sets is involved. On the contrary, in its adaptive version, an infinite number of sets are involved. In the 
adaptive setting, the task of identifying a point in the intersection of convex sets, is accomplished by projecting in 
parallel, the currently available estimate over the q most recently "received" sets. This provides the new estimate. If 
constraints are present, e.g., ifTTl , further projections are performed one for each of the constraint sets (the definition 



of the projection is given in Appendix A). Under some mild assumptions, the estimates converge to a point that lies 
in the intersection of all the involved convex sets. 

It has been pointed out (see, for example, |fT9ll ), that the sparsity-related a-priori knowledge can be "embedded" 
in the projection operators to the benefit of the algorithm's performance. To this end, the notion of the variable 
metric projection is introduced. The result of a variable metric projection of a certain vector, onto a closed convex 
set (see also Appendix A), is determined by: a) a positive definite matrix, which defines the induced inner product, 
b) the convex set, onto which the projection takes place, c) the vector, which is projected. The difference with the 
classical standard metric projections (Appendix A) is that in the latter the matrix, that defines the weighted inner 
product, is the simplest case of a positive definite matrix, i.e., the identity one. As it will become clear later on, 
for a properly chosen matrix, which is time-dependent and it is constructed via the current estimate at each time 
instance, the variable metric projection pushes small coefficients to diminish faster. In other words, by employing 
at each time instance a different inner product in our Euclidean space, we manage to change the topology of the 
space in order to favour sparse solution vectors. 

In the current paper, the adopted property sets, in which one seeks for a candidate solution, take the form of 
hyperslabs, i.e., 

Sn {h e M" : |d„ - u^h] < e}, (2) 

where e > is a user-defined parameter. The parameter e serves as a threshold and it takes into consideration the 
noise, as well as possible inaccuracies in the adopted model. In this setting, any point that lies within this hyperslab 
is in agreement with the current measurement. The choice of a hyperslab, in order to define the property sets, is 
in line with criteria that have been proposed in the context of the robust statistics rationale, e.g., ITSlI , ||3T| . The 
variable metric projection onto the respective hyperslabs is defined as ll32l : 

yh € R", P^f "^(h) + PnG;,^Un, (3) 



where 



/3n= < 



d„ - u^h + e „ 

— ll , it dn -u'^n < -e 

0, if \dn - u^h\ < e, 

if dn — uj^h > e. 



dn — u,^h — e 



Note that if G„ = /„, then (O is the standard metric projection onto a hyperslab. The positive definite diagonal 
matrix G^^ is constructed following similar philosophy as in |fT9ll , ||2T1 . The i-th coefficient of its diagonal equals 
to g^^ = + a 11^' , where a G [0,1) is a parameter, that determines to which extend the sparsity level of 

(n) 

the unknown vector will be taken into consideration, and h)^ denotes the i-th component of hn- Now, in order to 
grasp the reasoning of the variable metric projections, consider the ideal situation, in which G^^ is generated by the 
unknown vector h*. It is easy to verify that g^^ > 5,7^, if i G supp(ft,*), and i' ^ supp(/i*). Hence, employing 
the variable metric projection, the amplitude of each coefficient of the vector used to construct G^^ determines the 
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Fig. 1. Illustration of a hyperslab, the standard metric projection of a vector h onto it, denoted by Ps^(h), and the variable metric 
projection onto it. 

weight that will be assigned to the corresponding coefficient of the second term of the right hand side in (O. That 
is, components with larger magnitude are weighted heavier than those of lower magnitude. Loosely speaking, the 
variable metric projections accelerate the convergence speed when tracking a sparse vector, due to the fact that the 
procedure of assigning different weights makes the coefficients of the estimates with small amplitude, to diminish 
faster The geometric implication of it is that the projection is made to "lean" towards the direction of the more 
significant components of the currently available estimate. Obviously, since h* is unknown, in order to assign the 
previously mentioned weights, we rely on the available estimate of it, i.e., /i„, at each time instance. These concepts 
are depicted in Fig. [T] 

Remark 1: The variable metric projections rationale is in line with the so called proportionate algorithms ||20l , 
II2TI . ||33l . At the heart of these algorithms lies the fact that at every time instance different weights are assigned to 
the coordinates of the vector, which produces the next estimate. ■ 

As a second step, in order to exploit the sparsity of the unknown vector, sparsity promoting constraints, which 
take the form of £1 balls, are employed. In order to enhance convergence speed, the notion of the weighted £1 ball 
will be adopted i2|. A sparsity-aware adaptive scheme, based on set-theoretic estimation arguments, in which the 
constraints are weighted £1 balls, was presented in [31 . Given a vector of weights Wn = . . . jWm^]^, where 

> 0,Vi = 1, . . . ,m, and a positive radius, p, the weighted £1 ball is defined as: Bi-^[wn,p] := {h e K.™ : 
J^iLi — P}- Notice, that the classical £1 ball occurs if «;„ = 1, where 1 e M'" is the vector of ones. The 

projection onto Bf^ [wn, p], is given in [3] Theorem 1], and the geometry of these sets is illustrated in Fig. |2l 

It was shown, that the estimates of the algorithm proposed in ^ converge asymptotically to a point, that lies 
arbitrarily close to the intersection of the hyperslabs with the weighted £1 balls, with the possible exception of a 
finite number of outliers. In this paper, a generalized version of the algorithm presented in 131, will be developed 
in the next section. 

Remark 2: The weighted £1 ball is determined by the vector of weights, and the radius. Strategies of constructing 
the weights have been proposed in ||2l, fSj- More specifically, w^-"^ = +e„), i = 1, . . . ,m , where e„ is a 

sequence of positive numbers, used in order to avoid divisions by zero. It has been shown, e.g., |[3l, that by choosing 



Fig. 2. Illustration of a weighted l\ ball (solid line magenta) and an unweighted i\ ball (dashed line blue). 



the weights according to the previously mentioned strategy, a necessary condition that guarantees convergence of 
the algorithm to the unknown parameter is to set p > j|/i*||o, since then it holds that h* E B^-^ [Wn, p]- ■ 

Here we should note that in [S), standard metric projections onto the hyperslabs and the weighted £i balls take 
place. However, as it will become clear in Appendix C, since we use variable metric projections onto the hyperslabs, 
the induced inner product, which will be used in the analysis of the algorithm, is time varying and it is determined 
by the matrix G„. This fact forces us to employ variable metric projections onto the respective li balls too. 

Claim 1: Recall the definition of the diagonal matrix G„. The variable metric projection onto Bg^[wmp] is 
given by P^*^"' , = Gn^ P i gI 

Proof: The proof is given in Appendix B. ■ 

IV. Adaptive distributed learning 

We now come to the main point of this paper. Our task is to estimate the sparse, unknown parameter vector 
h* e K'", exploiting measurements collected at the K nodes of a network obeying the diffusion topology. An 
example of such a network is illustrated in Fig. [3] The node set is denoted by N ~ {!,..., K} and we assume that 
each node is able to communicate, i.e., to exchange information, with a subset of M, namely Mk, k = \,. . . ,K. 
This set, hereafter, will be called the neighbourhood of k. Moreover, each node has access to the measurement 
pair {dk,n,Uk,n)n^z^g , k G Af, where Uk.n G K'" and dk,n G and the measurements are related according to 
dk,n ~ uj^ j^h* + Vk^n, where Vk,n stands for the additive noise at each node. In a nutshell, what differentiates the 
adaptive distributed learning from the classical adaptive counterpart is the fact that in the former case, each node, 
besides the locally received measurement pair, also exploits information received by its neighboring nodes. For a 
fixed node, say k, and at every time instance, this extra information comprises the estimates of the unknown vector, 
which have been obtained, at the previous time instance, from the nodes with which communication is possible, 
i.e., G A/fe. The use of this extra information results in a faster convergence speed, as well as a lower steady 
state error floor, compared to the case where the measurement pair is solely used, e.g., ||6l, Q. One more objective, 
which makes the exchange of the estimates crucial, is that the distributed "nature" of our problem imposes the need 
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Fig. 3. Illustration of a diffusion network with K = 7 nodes. 

for consensus; this means that the nodes will have to converge to the same estimate. It has been shown, that this 
information exchange can lead asymptotically to consensus Q, lH, ||34ll . ||35]| . 

Depending on the way with which the estimates are exploited, the following cooperation strategies have been 
proposed: 

• Combine Adapt, in which, at every node, the estimates from the neighborhood are fused under a certain protocol, 
and then the aggregate is put into the adaptation step 161, (SJ, 1361 . 

• Adapt Combine, where prior to the combination step, comes the adaptation one l9l, l35l . 

• Consensus based, where the computations are made in parallel and there is no clear distinction between the 
combine and the adapt step Q, l34l . 

Now, let us shed light on the combination of the estimates coming from the neighbourhood of each node. Recall the 
previous discussion; an arbitrary node, k, is able to communicate with every node that belongs to JVk- We assume 
that the following hold true: k e Uk,yk e Af and I e Nk ^ k e Afi, "ik,l e N. Moreover, we consider that the 
network is strongly connected, i.e., there is a, possibly multihop, path that connects every two nodes of the network. 
These assumptions are very common in adaptive distributed learning (see for example ||6l, Q). As stated earlier, 
the estimates, received from the neighborhood, are fused under a certain protocol. The most common strategy is to 
take a linear combination of the estimates. To be more specific, we define the combination coefficients, for which 
we have that Ck,i{n) > 0, if ^ G Afk, Ck.i{n) = 0, if Z ^ Mk and J^iejVk ^k.iin) = 1. From the previous definition, 
it can be readily seen that every node assigns a weight to each one of the estimates which are received from the 
neighborhood. Two well known examples of combination coefficients are: the Metropolis rule, where 



1 



if / e TVfe and / ^ k, 



max- 



■-{\Nt\,\Ni\}' 



Ck,i{n) = < 1 - ^ 



otherwise. 



V 



and the uniform rule, in which the coefficients are defined as 



Ck,i{n) 



0, otherwise. 



Collecting all the coefficients for a network, we define the combination matrix C„, in which the fc, l-t\\ component 
is Ck,i{n). This matrix gives us information about the network's topology, as if the fc, Z-th entry is equal to zero, this 
implies that the nodes fc, / are not connected. The opposite also holds true, since a positive coefficient implies that 
the nodes are connected. Finally, we define the Km x Km consensus matrix, P„ = C„ ® Im, where the symbol 
stands for the Kronecker product. Some very useful properties of this matrix are ||35]| : 

1) \\Pn\\=l- 

2) Any consensus matrix P„ can be decomposed as 

P„ = X„ + BB^, 

where B = [bi, . . . , 6,„] is an Km x m matrix, and bk = g^, is a m x 1 vector of zeros except the 

V K 



k-th entry, which is one and X„ is an Km x Km matrix for which it holds that ||X„|| < 1. 
3) Pnk =hyheO ■.= {he M^''" : h = [fi^ , h^f, h e K™}. The subspace O is the so called consensus 
subspace of dimension m, and bk, k = 1, . . . ,m, constitute a basis for this set. Hence, the orthogonal projection 
of a vector, h, onto this linear subspace is given by Poitk) '■= BB^h, e M^™. 

V. Proposed Algorithmic Scheme 

The goal is to bring together the sparsity promoting "tools", which where discussed in section |IIT1 and to 
reformulate them in a distributed fashion by adopting the combine adapt strategy, which was presented in the 
previous section. The main steps of the algorithm, for each node k, at time instance n, in order to produce the next 
estimate, can be summarized as follows; 

Algorithm: 

1) The estimates from the neighbourhood are received and combined with respect to the adopted combination 
strategy, in order to produce 4>k,n = Ckj{n)hi^n,yk e JV. 

2) Exploiting the newly received measurements dk.n,Uk.n the following hyperslab is defined: Sk.n = {h E R™ : 
\dk.n — tj'-'k — ^fc}' where the parameter is allowed to vary from node to node. The aggregate <pk.n is 
projected, using variable metric projections, onto the q most recent hyperslabs, constructed locally, and a convex 



combination of them is computed. Analytically, the sliding window Jn := max{0, n — q + l},n is defined, and 
it determines the hyperslabs that will be considered at time instance n. Given the set of weights Vj G J7„, 
uJk.j, where ^k,i = 1,VA; G M, the convex combination of the projections onto the hyperslabs, i.e., 

"^j^j i^k.jPst. " (0A;,n) IS computcd. The effect of projecting onto a g > 1 number of hyperslabs is to speed 
up convergence 131 . 



3) The result of the previous step is projected onto the sparsity constraint set, i.e., the weighted £i ball. 
The previous steps can be encoded in the following mathematical formula: 
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4>k,n + yUfc^n I ^ UJk,jP]^"\4>k.n) - 4>k,n 



(4) 



where ^ik.n e (0,27Wfe,„), and 



(5) 



otherwise. 



The algorithm has an elegant geometrical interpretation which can be seen in Fig. |4] It turns out that the weighted 
ii ball, as well as G„ have to be the same for every node of the network, which yields that this information 
cannot be constructed locally. This fact, as it will be established in the theoretical analysis of the algorithm, is 
essential in order to guarantee consensus. Hence, a reasonable strategy, which will be adopted here, is to construct 
Wn and G„, using the methodology described in section |III1 via hk^^^^n, where kopt is the node with the smallest 
noise variance. It is obvious that this requires knowledge, in every node, of hk^^t.n, something that is in general 
infeasible. However, it is not essential to update the parameters at every time instance; instead, Wn and G„ can 
be updated at every, say n' > 1, time instances, where n' are the time steps required for hk^^t.n to be distributed 
over the network. Experiments regarding the robustness of the proposed algorithm with respect to n' are given in 
the Numerical Examples section. Moreover, as it will become clear in the Numerical Examples section, it turns out 
that the algorithm is robust in cases where the knowledge of the less noisy node is not available, and/or in cases 
where the assumption that these quantities must be common to all nodes is violated and each node uses the locally 
available values. 

Regarding the complexity of the algorithm, it has been shown in ||3], that if standard metric projections take place, 
then the complexity of the respective algorithm is 0{qm) coming from the projection operators and 0{m\og2m) 
occurring from the projection onto the weighted ii ball. If we employ the variable metric projections, at each node, 
it is obvious that the term G~^Uh,j, j G J^n has to be computed, and this adds qm multiplication operations. 

Remark 3: The algorithm presented in fS) is a special case of the scheme in if A' = 1 and G„ = The 
same also holds for the IPNLMS IHl if we let A' = 1, q = 1, = and P^^^^ = where / stands for the 
identity operator. ■ 

As it will be verified in Appendix C, the algorithm in ^ enjoys monotonicity, asymptotic optimality and strong 
convergence to a point that lies in the consensus subspace. The assumptions under which the previous hold are the 
following. 

Assumptions. 

(a) Define V?i G Z>oj = B£-^[Wn, p] H (Clj^j ClkeAf ^i^^-i) ■ Assume that there exists tiq G Z>o, such that 
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(b) There exists n\ E Z>o, such that G„ ~ Gm G,yn > ni. In other words, the update of the matrix Gn pauses 
after a finite number of iterationj^- 

(c) Assume a sufficiently small ei, such that Vfc £ Af, j^^" e [ei, 2 — Si]. 

(d) Assume \/k E N tOk ■= inf{ajfc_j : j G Jn,n G Z>o} > 0. 

(e) Define £ ;= O n O, where the cartesian product space Jl := 51 x . . . x il. We assume that rio^Z 7^ 0, where this 

K 

term stands for the relative interior of £ with respect to O (see Appendix A). 
Theorem 1: Under the previous assumptions, the following hold: 

(1) Monotonicity. Under assumptions (a), (b), (c), it holds that Vn > zq, Vh- € £, HZlri+i ^ 41|g < l|Z?iri ^ 4|1g, 
where zq := max{no, ni}, G is the /v7ti x Km block-diagonal matrix, with definition G_ := diag{G, . . . , G}, 

and = [hl„ , . . . , hl J"^ e R^'" , Vn G Z>o. 

(2) Asymptotic Optimality. If assumptions (a), (b), (c), (d) hold true then lim„^oo max{d(h,fc „+i, 5^^) : j G 
Jn\ =■ 0,Vfc e A/", where d(-,S'fcj) denotes the distance of hk.n+i from 5'^^- (see Appendix A). The previous 
implies that the distance of the estimates from the respective hyperslabs will tend asymptotically to zero. 

(3) Asymptotic Consensus. Consider that assumptions (a), (b), (c), (d) hold. Then lim„^oo \\hk,n~hi^n\\ ~ 0, Vfc,/ e 
M. 

(4) Strong Convergence. Under assumptions (a), (b), (c), (d), (e), it holds that lim„_i.oo Zln = h.*,h^ & O. So, the 
estimates for the whole network, converge to a point that lies in the consensus subspace. 

Proof: The proof is given in Appendix C. ■ 



VI. Numerical Examples 

In this section, the performance of the proposed algorithm is validated within the system identification framework. 
Due to the fact that the online algorithmic schemes, proposed in the literature, cover non-distributed learning 
scenarios, in the first experiment we compare the proposed algorithm against others in the context of a non- 
distributed system identification task. This essentially allow us to evaluate the variable metric projections scheme, 
since this is one of the contributions of this paper More specifically, we compare the proposed algorithm with 
the Adaptive Projection based algorithm using Weighted £1 Balls (APWLl) |f3l, with the Online Cyclic Coordinate 
Descent Time Weighted Lasso (OCCD-TWL), the Online Cyclic Coordinate Descent Time and Norm Weighted 
LASSO (OCCD-TNWL), both proposed in Q, and with the LMS-based, Sparse Adaptive Orthogonal Matching 
Pursuit (Spadomp) ||26l . The unknown vector is of dimension m = 512 and the number of non-zero coefficients, 
equals to 20. Moreover, the input samples m„ = [w„, . . . ,w„_m+i]^ are drawn from a Gaussian distribution, with 
zero mean and standard deviation equal to 1. The noise process is Gaussian with variance equal to cj^ = 0.01. 
Finally, the adopted performance metric, which will be used, is the average Mean Square Deviation (MSD), given 

'Notice that the matrix Gn is constructed via Aifc„j,t,n, hence Vn > ni, the variable metric projections is determined by hhc,pt,ni- In 
practice, for sufficiently large ni, the algorithm has converged and the fact that Gn is not updated does not affect the performance of the 
algorithm. 
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Fig. 4. Geometrical interpretation of the algorithm. The number of hyperslabs onto which (f>k,n is projected, using variable metric projections, 
is g = 2. The result of these two projections, which are illustrated by the dash dotted black line, is combined (red line) and the result is 
projected (solid black line) onto the sparsity promoting weighted £i ball, in order to produce the next estimate. 




Fig. 5. MSD for the experiment 1. 



by MSD(7i) = l/KJ2k=i \\hk,n - h*\\^, and the curves occur from an averaging of 100 realizations for smoothing 
purposes. 

In the projection-based algorithms, i.e., the proposed and the APWLl, the number of hyperslabs used per 
time update equals to q = 55, the width of the hyperslabs equals to e = 1.3 x ct, and the step-size equals to 
fin = 0.2 X A4n, where A4n is given in (|5]l, and the node subscript is omitted. Moreover, for the weights we choose 
cli„ = 1/ q. These choices are not necessarily optimal, albeit they lead to a good trade-off between the convergence 
speed and the steady state error floor The radius of the weighted ii ball equals to p = \\h*\\Q, the weights are 
constructed according to the discussion in section |III1 and e„ = 10^^. Furthermore, the weighting matrix Gn is 
defined according to the strategy presented in section |III] Regarding the parameter a, we observed that a value 
close to 1 leads to a fast convergence speed but it increases the steady state error floor, and vice versa. So, at the 
beginning of the adaptation, we choose a = 0.99 and at every 250 time instances, we set a = a/2. Finally, Wn and 
Gn are updated at every time instance, i.e., n' = 1. In the OCCD-TWL and the OCCD-TNWL, the regularization 
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Fig. 6. MSD for the experiment 2. 

parameter is chosen to be Atwl = ^/2a'^n\ogm, Atnwl = \/2(T^n*/^logm, respectively, as adviced in ||5|. The 
step size, adopted in the Spadomp, equals to 0.2, due to the fact that this choice gives similar steady state error 
floor with the projection-based algorithm]^ The forgetting factor of OCCD-TWN, OCCD-TNWL and Spadomp 
equals to 1 since, in the specific example, the system under consideration does not change with time. From Fig. 
|5] it can be seen that the proposed algorithm exhibits faster convergence speed compared to the APWLl to the 
common error floor. Moreover, the proposed algorithm outperforms the Spadomp, since it converges faster and the 
steady state error floor is slightly better. We should point out, that the complexity of the Spadomp is 0(m), which 
implies that for the previously mentioned choice of q, the proposed algorithm is of larger complexity. Compared 
to the OCCD-TWL, we observe that its performance is slightly better, compared to the proposed one, albeit the 
complexity of the algorithm is 0{m^). Finally, the OCCD-TNWL outerforms the rest of the algorithms, at the 
expense of a higher complexity, which is approximately twice that of OCCD-TWL. 

In the second experiment, we consider a network consisted of K = 10 nodes, in which the nodes are tasked to 
estimate an unknown parameter h* of dimension m = 256. The number of non-zero coefficients, of the unknown 
parameter equals to 20 and each node has access to the measurements {dk.n, u^.n), where the regressors are defined 
as in the previous experiment. The variance of the noise at each node is af, ~ O.Olc^fe, where G [0.5, 1], following 
the uniform distribution. We compare the proposed algorithm with the distributed APWLl, i.e., the proposed if we 
let Gn = Ivi, and the distributed Lasso (Dlasso) ||T4l . The Dlasso is a batch algorithm, which implies that the data 
have to be available prior to start the processing. So, here we assume that at every time instance, in which a new 
pair of data samples becomes available, the algorithm is re-initialized so as to solve a new optimization problem. 
For the projection-based algorithms, q = 20 and the rest of the parameters are chosen as in the previous experiment. 
Moreover, the combiners Ck.i{n) are chosen with respect to the Metropolis rule. Finally, the regularization parameter 
in the Dlasso is set via the distributed cross-validation procedure, which is proposed in |fT4l|. From Fig.|6]we observe 
that the Dlasso outperforms the projection-based algorithms and that the proposed algorithm converges faster than 
APWLl. However, for q = 20, the complexity of the proposed algorithm is significantly lower than that of the 

^Extensive experiments have shown that a choice of a smaller step-size, results in a slower convergence speed, without significant 
improvement in the steady state error floor. 
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Fig. 7. MSD for the experiment 3. 
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Fig. 8. MSD for the experiment 4. 

Dlasso. Dlasso, at every time instance, requires the inversion of a m x m matrix. 

In the third experiment, we study the sensitivity of the proposed algorithm to choice of the parameter n', i.e., the 
frequency at which Wn and Gn are updated. To this end, the parameters are the same as in the previous experiment, 
but we set different values to n'. Fig. [7] illustrates that the algorithm is relatively insensitive to the frequency of the 
updates, since even in the case where n' ~ 20 the algorithm exhibits fast convergence speed. This is important, since 
the robustness of the proposed scheme to choice of the parameter n' makes it suitable to be adopted in distributed 
learning. 

In the fourth experiment, we validate the performance of the algorithm in a non-stationary environment. It is by 
now well established that a fast convergence speed does not necessarily imply a good tracking ability fJTl . More 
specifically, we consider that a sudden change in the unknown parameter takes place. So, until h* changes, the 
parameters remain the same as in the second experiment, and after the sudden change, we have that ||/i*||o = 15. 
The radius of the weighted £i ball is set equal to 23, due to the fact that through experiments we observed relative 
insensitiveness of the performance of the proposed algorithmic scheme to choices of p, as long as it remains larger 
than ||/i*||o- Furthermore, we assume the algorithm is able to monitor sudden changes of the orbit (/ife.n)nez>o' ™ 
order to reset the value of a when the channel changes. To be more specific, we reset the value of a, if the ratio 
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Fig. 9. Squared distance from the consensus subspace, for experiment 5. 

\\hk.n+i — ^fc,n||/||'ife,n ~ ''•fc.ri-ill, Vfc G J\f , Is greater than a threshold, which is chosen, here, to be equal to 
10. This strategy is adopted since we observed that if the algorithm has converged, the previously mentioned ratio 
takes values close to 1, whereas if an abrupt change takes place in the unknown parameter, then the value of the 
ratio increases significantly. From Fig. |8] it can be observed that both the projection-based algorithms enjoy good 
tracking ability, when a sudden change occurs. Moreover, as in the previous experiments, the proposed algorithm 
converges faster than the APWLl to a similar error floor 

Finally, in the fifth experiment, we study the robustness of the proposed scheme, with respect to adopting 
different strategies in order to construct Wn and G„. To this end, we consider the following strategies: a) the 
previously mentioned quantities are constructed using the node with the smallest noise variance (Proposed a), b) 
Wn and Gn are generated via the node with the largest variance (Proposed b) and c) w„ and G„ are constructed 
locally at every node (Proposed c). Obviously, the latter one violates the theoretical assumption of having common 
weights to all nodes. In order to verify whether the nodes reach consensus, we plot the squared distance of from 
the consensus subspace, i.e., \\h.^ — Po{hn)\\'^- the previous experiments, the curves occurs from an averaging 

of 100 independent experiments. From Fig. |9] it can be readily seen that the distance of h^^ from the consensus 
subspace, is decreasing as time steps increase. It is interesting, that even in the Proposed c where the assumption, 
under which asymptotic consensus is achieved, is violated the estimates for the whole network tend asymptotically 
to the consensus subspace. 

VII. Conclusions 

A sparsity-aware adaptive algorithm for distributed learning has been proposed. The algorithm builds upon set- 
theoretic estimation arguments. In order to exploit the sparsity of the unknown vector, variable metric projections 
onto the hyperslabs within which we seek for a possible solution take place. Moreover, extra projections onto sparsity 
promoting weighted £i balls are employed in order to enhance further the performance of the proposed scheme. 
Full convergence analysis has been derived. Numerical examples, within the system identification task, demonstrate 
the comparative performance of the proposed algorithm against other recently published algorithms. 
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Appendix A 
Basic Concepts Of Convex Analysis 

The stage of discussion will be W" and the induced inner product, given a positive definite m x m matrix V, 
is {hi,h2)v = hJVh2. A set C C M™, for which it holds that V/ii, h.2 £ C and Vt e [0, 1], thi + (1 - t)h2 E C, 
is called convex. Moreover, a function Q : M™ M will be called convex if Vhi, h2 E M™ and Vt E [0, 1] the 
inequality Q{thi + (1 — t)h2) < tQ{hi) + (1 — t)6(/i2) is satisfied. Finally, the subdifferential of 6 at an arbitrary 
point, h, is defined as the set of all subgradients of at ft. ( ||38l , ||39l ), i.e., 

d(v)ei{h) := {s e M™ : &{h) + {x - h, s)v < Q{x), Vx E M™}. 

The distance of an arbitrary point h from a closed non-empty convex set C, with respect to V, is given by the 
distance function 

S^\-,C) : K™ ^ [0,+oo) 

: h i-> inf - x\\v : x eC}, 

and if we let V be the identity matrix, the Euclidean distance is given. This function is continuous, convex, 
nonnegative and is equal to zero for every point that lies in C [ |39l . Moreover, the projection mapping, onto 
C, is defined as P^\h) := argmin3,gc||/i — x\\v, and as in the distance function, \f V — Im the standard metric 
projection is obtained. 

Finally, the relative interior of a nonempty set, C, with respect to another one, S, is defined as 

ns{C) = {ft e C : 3eo > with ^ {B^ho,eo) n 5) c C}, 

where 5(^0. eo) open ball with definition -B(ho,eo) *= • 11^ ^ ''■oil < ^o} (see for example ||40l ). with 

center fto and radius equal to eo- 

Appendix B 

Variable Metric Projection onto the Weighted li Ball 

The variable metric projection of ft, onto Bg^ [k;„, p], is given by 

min ||ft-a;|||,^ 

rn 

11 1 

where a; := [.ti, . . . , Xm\'^ ■ However, ||ft — ^Wg^ = IIG"^ (ft — x) |p = \\Gnh — where ^ := GnX. Moreover, 
X = Gn ^ ^ 'i^ Xi = \fg^£,i,i = 1, . . . where are the coefficients of ^. From the previous, it holds that 
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^iLi = \J 9i n'^t^^ Hence the initial optimization problem, is equivalent to 



i=l 



The solution of the previous optimization, is the standard metric projection of G?ih onto [Gn ^Wn, p] and it can 

be found in 0. So, from the previous lopt = -i (g|/i) ^ P'^"' _ i (/i) = G,T^P _i {Glh). 



Appendix C 
Proof of Theorem 1 

A. Monotonicity 

Lemma 1: Define the following non-negative loss functions, Vfc e M: 

Vn G Z>o, V/i e R™, efc,„(/i) := <^ (6) 

[o, ifXfc,„=0, 

where Ik^n := £ X : 0fe,„ ^ ^k,]} and iA.^„ := Ejej-,. ^k,j'^(G){<t}k,n, Sk,j)- Then © is equivalent tcj] 



Vn e Z>o, Vfc e AA, /ife,„+i = <^ ^iI"P1\^ liefe,J(^fc,„)llG / (7) 



where 6^ „(0fc,?i) is the subgradient of the function and \k,n G (0, 2). 

Proof: First of all, notice that if X^.n 7^ 0, then there exists jo G Jn such that </>fc,„ ^ S'fe.jo ^(0) {4>k.n, Sk.ja ) > 
0. Hence, ifc „ > iOk,jo^(G){4>k.n, Sk.j„) > 0, which implies that the denominator in (|7]i is positive and the cost 
function is well defined. Now, a subgradient of the distance function, i.e., d(c;)(-, 5fc j), is the following 11411 : 

( h~PP{h) 

d'fc)(/i,5fc,,) = { d(G)(/i,5fc,,) (8) 
0, otherwise. 



Recalling basic properties of the subdifferential (see for example ||39l ). we have that 

dQk,n{h) = <J • (9) 

[{□}, if2:fc,n=0. 

^The time dependence on Gn is omitted for simplicity in notation. 
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So, combining (jSj, ^ and if 1^ „ 7^ we have 



L 



k.n 



^{G)i4>k,n,Sk,j) 
X! ^k,j (4>k,,i - Ps^\4>k,n) 



Lk,n 



J elk 



(10) 



Nevertheless, since 7^ 0, then there exists jo G >7n such that ^ S'fejo -Pj,^ \4>k.n) 7^ </>fc.n- So, if 
2^fc.n 7^ then 9'j, ^^{4>k.n) 7^ 0. Following similar steps as in [3l|, it can be proved that Vn > zo, Vj e J7n, Vfc e 

0fc,n('Afc,n) = ^ </)fc ,„ = Y^jeJ^ ^k,jPs^ ■i(t>k,n)- From this fact, if we define ^tfe,„ := Mk,n>^k,n, and if we 
substitute (fTOl i in (|7]i the lemma is proved. ■ 

Claim 2: It holds that ||P/i, - h\\c < \\h - 4||g, V/j, G O, V/j, G R^"*, where P is a Km x ifm consensus 
matrix with ||P|| = 1. 

Proof: From the definition of || • it can be readily seen that \\Ph — h\\G = \\Gy (^Ph — hj \\ = 

hi 

\\Gy P (h — h] II, where this holds since h e O. Moreover, h 



h 



K 



,hk e M™,fc e Af and h e O ^ 



h ^ 



h 

foUowmg 



h 



, h e W". Recalling the definition of the consensus matrix, with coefficients Ck.i,k,l G Af, we have the 



II G5P (ji-h 



J2le^r^ ci.iG^ (hi - h 



T.i^Mk ^kjG-- [hi-h 



G-- [hK-h 
h h" 



< \\P\\ 



[hi-h 



G 



(hK - h) 



(11) 



From ( fTTT i. our claim is proved. ■ 
First of all, given a convex function 8 : M'" — > M, with non-empty level set, where the level set is defined 

(G) 

lcv<oO {h e M™ ; <d{h) < 0}, let us define the subgradient projection mapping, as follows Tq 
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ED: 

\h, he lcv<o6, 

where Q'{h) is any subgradient of Q, at h. Similarly, we define the relaxed subgradient projection mapping, 
^aA (^) -^1 + HT^^Hh) - /), A e (0, 2), where / is the identity mapping. 

Now, given a non-empty closed convex set, say C C M'", and a convex function O : M'" M, such that 
C n lev<oe 7^ it holds that HI: 

Vh € R",V/i e C n lcv<oe : - Pcr45(h)||2j < H/i - hfa - !1^cT45(/i) - (12) 

Following similar steps as in 0, it can be proved that Vn, > zq, 4>k.n S lcv<oSfe.n ^ Ifc.n — and Vn > 
zo,4>k,n ^ 1cv<oOa;_„ Ife.n 0. Morcovcr, lcv<o6fe.n = Pljeifc "^/^j' ^ ^ ^- Recall the definition of the 
relaxed projection mapping; it can be readily seen that hk,n+i = Pg^ pl'^&k {4>k,n)- Exploiting this fact, 
under Assumptions (a), (b), and ( fT2b we have that 

Vn > zo,VA: e M^h e n : 

< ||0,,„ - - - /^IIg- (13) 

Recalling the definitions h,„ = [/if „, . . . , It^ J^ e R^™, P„/j,„ = [</)f „, . . . , ^^^^^J"^ € R^", and O, we have 

< mm I " I \\Pnhn - Zi„+iIIg 

<\\Pnhn-h\\h-\\K+i-h\\h. (14) 

Nevertheless, from Claim 2, the previous inequality can be rewritten 

< \\PnK-MG-\\hn+l-MG<\\hn-MG- WK+l ' Mg- 

Hence, 

yn>zo,yhe€: I1/i„+i-4!Ig< I1Zi„-4|1g, (15) 

which completes our proof. ■ 
B. Asymptotic optimality 

A well known property of the projection operator (see for example BTI ). is the non-expansivity, i.e., given a 
non-empty set C, \\P^'^\hi) - P^'^\h2)\\G < \\hi - /isHg, V/ii, /i2 € R™- Recall the definition of the algorithm 



given in Then, Vfc e Af,\/n > zo,\/h E fl, we have 



||^fc,n+l - Hg 



P 



(G) 



G 



G 



(G) 



< 



, > 0fe,ri(0fe,ri) m/ /J, \ L 

l|0fc,«(</'fc,«)llG 



(16) 



G 



where the equality in the second line holds since, by definition, h, e il C B^^ [Wn , p] and the inequality, from the 
non-expansivity of the projection operator. Assuming that 0'^, „(^fe.„) 7^ 0, Vfc e JV, and rewriting ( fT6b for all the 
nodes of the network we have 

2 



< 



G 



ei,^('/'i,^) Q/ 



0'l,n(01,rO 



(pK.n - A 



lle'/f.J'f^/cOli 



G 



01, n - 



(^kMk.n) 



G 



G 



Nevertheless, 



2 ^ Afc_„ 



G 



(17) 



G 



From the definition of the subgradient, we have 



\Pnhn - Mg < \\hn - Mg- 



(18) 



G 



(19) 



where the last equation, holds due to the fact that /i e f2 <^ 8J. = 0. Taking ( fTsT l and (fT9] l into consideration, 
we obtain 



< 



G 



h„ - h 



G 



— ^ Afe_„(2 — Afc_„) 



Qk,ni4>k,n) 

2 ■ 
G 



(20) 



Here, notice that the sequence 
implies that 



/i„ - h 



G 



is bounded and monotone decreasing, hence it converges. The latter fact 



lim 



h„ h 



G 



hn+i - h 



G 



0. 



(21) 
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Under Assumption (c), (|2Ql ) can be rewritten 
Taking limits in ( |22] | and recalling ( I2TI 1 we have that 



Qk,n{4>k,n) 
\^'k,n{^k,n)\\h 



< 



h„ — h 



G 



tkn+l - h. 



G 



(22) 



lim 



0, Vfc e Af. 



™ \\%,niM\h 

If we follow similar steps as in ||3|, it can be verified that Vn G Z>o, Vfc eJ\f,\/he M™ : ||e;, „(/i)||g < 1- So, if 



0, n H> 00. 



(23) 



Obviously, recalling the previous discussion, ©i.„(</>A;,n) = <^ Qk,n{4>k,n) = 0,Vn > zq- Combining this fact 
together with (l23T l. we have that 

Vfce AA, lim efc,„(0fc,„) =0. (24) 

Now, following similar steps as in fi\, it can be shown that there exists D > Q such that Lk.n < D,yk E A^, Vn G 
Z>o. From the definition of Qk,n, and under Assumption (d), we have Vfc E Af 

df^^{cf)k,n, Sk,j) 



— 6fc.n(0/c,n) > — Wfcj- 



-'k,n 



Taking limits in the previous inequality, we obtain that 

lim max{d(G)(</>fc,„,S'fc,j) : j € Jn} = 0. 
Combining ( fT4l i with the result of Claim 2, we have 



(25) 



Vn > za,\/h e £ 



< min 

A: 



2- Ai 



\Pnhn ZIii+iIIg 



<IIAi. -/^l|^-||K+l-4llG• 



(26) 



Taking limits in ( |26] | and recalling ( 1211 1 gives us 



lim ||P,Ai -Zln+illc = ^ lim V ||</>fe,„ - /ifc,„+i |j^ = 0. 



(27) 



Fix an arbitrary point v G Sk.j,yk E Af,yj S Jn- Then from the triangle inequaUty we have 
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||^fe,,i+l - ■wIIg < |l^fe,,i+l - </>fe,n||G + !|<AA;,n - v\\g 
inf \\hk,n+l -v\\g < \\hk,n+l - </>fe,nl|G + inf \\4>k.n - v\\g 

^{G){f^k,n+l, Skj) < ||/lfc,ri+l — <t>k,n\\G + (i(^G){4>k,n, Skj) 



(28) 



If we take Umits in ( |28] l. from (|25T l and (|27] l. it can be seen that 

lim d(G)(/ifc,„+i,5fe.j) = 0,V/c G Af,yj £ j;, ^ lim V d(G)(/ife.„+i, ^fe.j) = 0, Vfc e M (29) 
The definitions of the distance function and the projection operator, yield 



d{hk.n+l, Sk.j) = \\hk,n+l — PSk.ji^k.7i+l] 



< \\hk,n+l - P^'^^^ihk.n+l)]] 



(30) 



Nevertheless, the Rayleigh-Ritz theorem implies |02l| V/i G R™ : \\h\\ < T^-^il^\\h\\G, where is the smallest 
eigenvalue of G. Combining this fact as well as ( [30l l we obtain 



d{hk,n+l,Sk^j) < \\hk,n+l - ihk,n+l 



< Tmln\\hk^n+1 - P^^' {hk.n+l)\\ G ^ 0, n ^ C30, Vfc G A^, 



(31) 



where the limit holds from (|29l l. From the previous, it is not difficult to obtain that 



which completes our proof. 



lim inax{d{hk,n+i,Sk,j) : j G Jn} = 0, 



C. Asymptotic Consensus 

In 13511 it has been proved, that the algorithmic scheme achieves asymptotic consensus, i.e., \\hk,n — ^z,n|| ^ 
0, n — >■ oo, V/e, I E J\f if and only if 

lim ||/i„-Po(/iJ|| =0. (32) 



Let Assumptions (a), (b), (c), (d), hold true. We define the following quantity 



(33) 



Obviously from ( |27l ) 



lim e„ = 

n— >-oo 



(34) 
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Now, if we rearrange the terms in ( [33T l and if we iterate the resulting equation, we have: 

Jln+l ~ Pntkn + 

= PnPn-lh.n-1 + -fri£n-l + £n ^ ■ • ■ 
n n n—j 

= n ^'^0 + n 1 + £n 

1=1 j=l 1=0 

If we left-multiply the previous equation by (iKm ~ BB^), where Ixm is the Km x identity matrix, and 
follow similar steps as in f3J, Lemma 2] it can be verified that lim || (iKm — BB^) h„,i\\ = which completes 
our proof. ■ 

D. Strong Convergence 

We will prove, that under assumptions (a), (b), (c), (d), (e), lim„_i,ooZi„ = h*Ttk* •= C>. Recall that the projection 
operator, of an arbitrary vector h G K-'^''" onto the consensus subspace equals to Poih) = BB^h,\/h £ R-'^™. 
Taking into consideration Assumption (e) together with ( fTSl ), from 1161 Lemma 1] we have that there exists h^^EO 
such that 

limPoihJ^k- (35) 

n— >oo 

Now, exploiting the triangle inequality we have that 

\\K-L\\ < \\K - Po{h„)\\ + \\L- Po{hJ\\ ^0, n ^ 00, (36) 
where this limit holds from ( l32b and ( |35] ). The proof is complete since ( |36] | implies that liijin^oo tin ^ tk*- ■ 
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